Compact In-Memory Models for Compression of Large Text Databases
نویسندگان
چکیده
For compression of text databases, semi-static wordbased models are a pragmatic choice. They provide good compression with a model of moderate size, and allow independent decompression of stored documents. Previous experiments have shown that, where there is not sufficient memory to store a full word-based model, encoding rare words as sequences of characters can still allow good compression, while a pure character-based model is poor. In addition, there are other kinds of semi-static model that can be used for text, such as word pairs. We propose a further kind of model that reduces main memory costs of a word-based model: approximate models, in which rare words are represented by similarly-spelt common words and a sequence of edits. We investigate the compression available with different memory efficient models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can significantly improve the compression available in limited memory and greatly reduce overall memory requirements.
منابع مشابه
Sux Array 9=@.%"%k%4%j%:%'$nhf3s Sux Array $,$"$k!#$3$l$oj8;zns$na4$f$n@\hx<-$n%]%$%s%?$r<-=q=g$k3jg<$7$?g[ns$g!" Comparison among Sux Array Construction Algorithms
Sux array is a compact data structure for searching matched strings from text databases. It is an array of pointers and stores all suxes of a text in lexicographic order. Because its memory requirement is less than tree structures, it is eective for large databases. Moreover, constructing the sux array is used in the Block Sorting compression scheme. We compare algorithms for constructing sux a...
متن کاملA limited memory adaptive trust-region approach for large-scale unconstrained optimization
This study concerns with a trust-region-based method for solving unconstrained optimization problems. The approach takes the advantages of the compact limited memory BFGS updating formula together with an appropriate adaptive radius strategy. In our approach, the adaptive technique leads us to decrease the number of subproblems solving, while utilizing the structure of limited memory quasi-Newt...
متن کاملCombining Text Compression and String Matching: The Miracle of Self-Indexing
This decade has witnessed the raise of what I consider the most important breakthrough of modern times in text compression and indexed string matching. Selfindexing is the mechanism by which a text is simultaneously compressed and indexed, so that the self-index occupies space close to that of the compressed text, provides random access to any part of it, and in addition supports efficient inde...
متن کاملText Compression for Dynamic Document Databases
For compression of text databases, semi-static word-based methods provide good performance in terms of both speed and disk space, but two problems arise. First, the memory requirements for the compression model during decoding can be unacceptably high. Second, the need to handle document insertions means that the collection must be periodically recompressed, if compression efficiency is to be m...
متن کاملImplementation of VlSI Based Image Compression Approach on Reconfigurable Computing System - A Survey
Image data require huge amounts of disk space and large bandwidths for transmission. Hence, imagecompression is necessary to reduce the amount of data required to represent a digital image. Thereforean efficient technique for image compression is highly pushed to demand. Although, lots of compressiontechniques are available, but the technique which is faster, memory efficient and simple, surely...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999